Using Rate Limiters To Prevent Runaway Workflows
Learn how rate limiters can be used to prevent runaway workflows.
We'll cover the following
DevOps engineers can be responsible for a service that is made up of dozens of microservices. These microservices can then number in the dozens to the tens of thousands of instances running in data centers around the globe. Once a service consists of more than a couple of instances, some form of rate control needs to exist to prevent bad rollouts or configuration changes from causing mass destruction.
Some type of rate limiter for work with forced pause intervals is critical to prevent runaway infrastructure changes.
Rate limiting is easy to implement, but the scope of the rate limiter is going to depend on what our workflows are doing. For services, we may only want one type of change to happen at a time or only affect some number of instances at a time.
The first type of rate limiting would prevent multiple instances of a workflow type from running at a time; for example, we might only want one satellite disk erasure to occur at a time.
The second is to limit the number of devices, services, and so on that can be affected concurrently; for example, we might only want to allow two routers in a region to be taken out for a firmware upgrade.
For rate limiters to be effective, having a single system that executes actions for a set of services can greatly streamline these efforts. This allows centralized enforcement of policies such as rate limiting.
Let's look at the simplest implementation of a rate limiter in Go using channels.
In the early days, Google did not own all the data center space it does today—they were in a lot of rented space with a large number of machines. In some places, however, this was prohibitively expensive. To speed up connectivity in these places, Google would rent small spaces that could have cache machines, terminate HTTP connections, and backhaul the traffic to a data center. Google called these satellites.
Google has an automated process for the decommissioning of machines. One part of this is called disk erase, whereby the machines have their disks wiped.
The software was written to grab a list of machines for a satellite and filter out other machines. Unfortunately, if ran twice on a satellite, the filter is not applied, and our list of machines is all machines in every satellite.
Disk erase was very efficient, putting all machines in all satellites in disk erase at once before anything could be done.
For a more detailed breakdown, (read more here on Google, where several Site Reliability Engineers (SREs) have provided more detail in the context of postmortem.
We can look at the filtering part of the code and discuss bad design, but there will always be badly written tools with bad inputs. Even if we currently have a good culture for code reviews, things slip by. During times of hypergrowth with new engineers, these types of problems can rear their ugly heads.
Some tools that are known to be dangerous in the hands of a small group of experienced engineers can be used quite safely, but new engineers without experience or ones lacking proper fear can quickly devastate our infrastructure.
In this case and many other cases, centralized execution with rate limiting and other mandatory safety mechanisms allow new people to write tools that may be dangerous but limited in their blast radius.
Channel-based rate limiter#
A channel-based rate limiter is useful when a single program is handling the automation. In that case, we can make a limiter that is based on the size of a channel. Let's make a limiter that allows only a fixed number of items to be worked on at a time, as follows:
We now have something that can limit the number of items that can be worked on.
Let's define a simple type that represents some action to be executed, as follows:
This defines a Job that can do the following:
Validate a
pb.Jobdefinition passed to us.Run the job with that definition.
Here is a very simplistic example of executing a set of jobs contained in something called a block, which is just a holder of a slice of jobs:
In the preceding code snippet, the following happens:
Line 3: We loop through a slice of
Blockinside thework.Blocksvariable.Line 5: We loop through a slice of
Jobsin theblock.Jobsvariable.Line 7: If we already have
req.limititems running,limit <- struct{}{}will block.Lines 9–15: It executes our job concurrently. When our goroutine ends, we remove an item from our
workLimitqueue.Line 19: We wait for all goroutines to end.
This code prevents more than req.limit items from happening at a time. If this were a server, we could make limit a variable shared by all users and prevent more than three items of work from occurring for all work that was happening in our system. Alternatively, we could have different limiters for different classes of work.
A note about that job := job part. This is creating a shadowed variable of job. This prevents the job variable from being changed inside our goroutine when the loop and the goroutine are running in parallel by making a copy of the variable in the same scope as the goroutine. This is a common concurrency bug for new Go developers, sometimes called the for loop gotcha. The code below can be used to work through why this is necessary.
/
We have completed the following example in the terminal below to explore these concepts:
/
You can see a channel-based rate limiter in action in the workflow service inside runJobs() here.
Token-bucket rate limiter#
Token buckets are normally used to provide burstable traffic management for services. There are several types of token buckets, the most popular being the standard token bucket and the leaky token bucket.
These are not normally deployed for an infrastructure tool, as clients tend to be internal and more predictable than external-facing services, but a useful type of a token bucket can be used to provide pacing. A standard token bucket simply holds some fixed set of tokens, and those tokens are refilled at some interval.
Here's a sample one:
This preceding code snippet does the following:
Lines 1–3: Defines a
buckettype that holds our tokens.Lines 5–20: Has
newBucket(), which creates a new bucket instance with the following attributes:size, which is the total amount of tokens that can be stored.incr, which is how many tokens are added at a time.interval, which is how often to add to the bucket.It also does the following:
Starts a goroutine that will fill the bucket at intervals.
Will only fill to the max
sizevalue.
Lines 22–29: Defines
token(), which retrieves a token: ‚If no tokens are available, we wait for one.
If a
Contextis canceled, we return an error.
This is a fairly robust implementation of a standard token bucket. We may be able to achieve a faster implementation using the atomic package, but it will be more complex to do so.
An implementation with input checking and the ability to stop a goroutine created with newBucket() can be found below:
/
If we want, we could use a token bucket to only allow execution at some rate we define. This can be used inside a job to limit how fast an individual action can happen or to only allow so many instances of a workflow to happen within some time period. We will use it in our next section to limit when a particular workflow is allowed to happen.
Our generic workflow system has a token bucket package as shown below:
/
In this lesson, we looked at how rate limiters can be used to prevent runaway workflows. We talked about Google's satellite disk erase as a case study on this type of event. We showed how channel-based rate limiters can be implemented to control concurrent operations. We talked about how a token bucket could be used to rate-limit a number of executions within a certain time period.
This lesson is also laying the foundation of how executing actions, defined as a job, will work in the workflow system example we are building.
Using Overload Prevention Mechanisms
Building Workflows That Are Repeatable and Never Lost